T6 M0: Technical plan + analysis notebook for multi-objective vector …#61
T6 M0: Technical plan + analysis notebook for multi-objective vector …#61carlosrod723 wants to merge 20 commits intoAgentOpt:experimentalfrom
Conversation
…, evaluate_vector, BasicSearch integration, 59 tests
…k, add weight-sensitivity demo
docs/T6_technical_plan.md
Outdated
| """ | ||
| score, _ = self.get_feedback(query, response, reference, **kwargs) | ||
| if isinstance(score, dict): | ||
| return float(np.mean(list(score.values()))) |
There was a problem hiding this comment.
We should leave this behavior to be configurable from the Objective side.
It should not be hard coded here.
There was a problem hiding this comment.
Also why do we need this method from the Guide to begin with? I guess the question is whether we would require passing objective into Guide?
Or asked differently, should the Guide be the one who creates the Objective and sends them around? @allenanie what do you think?
| """ | ||
| ... | ||
|
|
||
| def aggregate_vector_scores(scores: list) -> Union[float, Dict[str, float]]: |
There was a problem hiding this comment.
As above, the logic should be implemented by Objective.
docs/T6_technical_plan.md
Outdated
| Isolate all multi-objective logic into one new module (`opto/trainer/objectives.py`) containing **pure functions**: | ||
|
|
||
| ``` | ||
| normalize_score() → scalar ↔ dict conversion |
There was a problem hiding this comment.
Let's use a different name. normalize_score implies some sort of scaling or shifting is done.
Let's use something explicit like to_score_dict or some term that is more neutral
|
Hi @chinganc I propose to address your comments by moving all dict -> scalar + aggregation policy into Concretely:
This keeps Guide responsible for producing raw metrics, and keeps ObjectiveConfig (trainer-side) responsible for aggregation/scalarization/selection, without passing ObjectiveConfig into the Guide. |
…larize_dict, aggregate to objectives.py
…ctive_convex_fn.py
…ch + 12 integration tests
…ation, better plots
…zon exhaustion during multi-candidate validation
…dcoded 0.0 when no test_dataset
|
@doxav In Guide, I think |
|
Good point @chinganc. Here's a proposal:
def get_score_dict(self, query, response, reference=None, **kwargs): This means:
No breaking changes. The existing guides that return (float, str) still work. Guides that want multi-objective just return (dict, str) from get_feedback(). metric() would also need to handle dict scores from get_feedback(). When it gets a dict, it can use the scalarize_dict policy from ObjectiveConfig, or fall back to mean. Want me to implement this? |
|
How about this? Having the user to implement a new For thew new |
|
Thanks Ching-An. I like this approach, it's cleaner than what I proposed. To confirm my understanding:
This means trainers can always assume dict from get_feedback(). There's no branching on type. I'll update the PR to follow this pattern. One edge case to flag --> our TokenUsageAugmentingGuide for GSM8K augments scores with token metrics not present in _get_feedback(). I'll handle that as an override of get_feedback() that merges the extra metrics after calling super(). |
|
Sounds good
…Sent from my iPhone
On Thu, Feb 26, 2026 at 9:58 AM Carlos Rodriguez ***@***.***> wrote:
*carlosrod723* left a comment (AgentOpt/OpenTrace#61)
<#61 (comment)>
Thanks Ching-An. I like this approach, it's cleaner than what I proposed.
To confirm my understanding:
- User implements _get_feedback(query, response, reference, **kwargs)
→ (float, str) or (Dict[str, float], str)
- Base get_feedback() calls _get_feedback() and normalizes: if float,
wraps as {"score": float} → always returns (dict, str)
- get_score_dict() indexes into the dict (e.g., selects objective
based on config)
- metric() serializes from get_score_dict()
This means trainers can always assume dict from get_feedback(). There's no
branching on type. I'll update the PR to follow this pattern. One edge case
to flag --> our TokenUsageAugmentingGuide for GSM8K augments scores with
token metrics not present in _get_feedback(). I'll handle that as an
override of get_feedback() that merges the extra metrics after calling
super().
—
Reply to this email directly, view it on GitHub
<#61 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEHPJGGELTYRTDFURLBIH5L4N4X3BAVCNFSM6AAAAACUQW42DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSNRYGI2TEMJTGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
M0 delivery for T6 Multi-Objective Vector Scores.
Deliverables:
docs/T6_technical_plan.md— Refined tech plan with API signatures, edge cases, test planexamples/notebooks/t6_m0_analysis.ipynb— Colab notebook (no API keys needed)Notebook demonstrates current baseline behavior and a working prototype of weighted vs Pareto selection with deterministic tie-break validation.